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Summary 

A Monte Carlo EM algorithm is considered for the maximum likelihood estimation of multi- 
variate probit models. To sample from truncated multivariate normals we introduce a sequential 
Monte Carlo approach, while to improve the efficiency in driving the sample particles to the 
truncation region Student t distributions are invoked before taking their limit to a normal. After 
the initial sampling, a sequential Monte Carlo step can be performed to shift to new parameter 
values, recycling the samples and so reducing the computational cost. We discuss the identifi- 
ability issue and show that the invariance of the likelihood provides the means to ensure that 
constrained and unconstrained maximization are equivalent. Finally, for the multivariate probit 
model we derive a simple iterative procedure for either maximization which takes effectively no 
computational time. Applying our method to the widely used Six Cities dataset we find parame- 
ters which improve the maximum likelihood compared to other approaches. 

Some key words: Maximum likelihood, Multivariate probit, Monte Carlo EM, adaptive sequential Monte Carlo 
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1. Introduction 

Multivariate probit models, originally introduced by lAshford & SowdenI (119701) for the bi- 
variate case, are particularly useful tools to capture so me of the correlation structure of binary, 
and more generally m ultinomial, response variables (McCulloch, 1994; McCulloch & Rossi, 
1994| : lBock & Gibbons . 1996 : Chib & Greenberg . 1998 : Natarajan et al. . 2000 : Imai & van Dvk . 
20050 . Inference for such models is typically computa tionally involved and o ften still impracti- 
cable in high dimensions. To mitigate these difficulties. IVarin & Czadd(|2010l) recently proposed 
a pseudo-likelihood approach as a surrogate for a f ull likelihoo d analysis. Similar pairwise likeli- 
hood approaches were also previously proposed by lKuk & No tt (2000) and Renard et al. (2004]). 

Du e to the data augrnentat ion nature of the problem, the estimation maximization (EM) algo- 
rithm ((Dempster et al.[ Il977h is typically employed for maximizing the likelihood as its iterative 
procedure is usually more attractive than classical numerical optimization schemes. Each itera- 
tion consists of an estimation (E) step and a maximization (M) step and b oth should ideally be 
easy to implement. For cases in which the E step is analytically intractable, IWei & Tannen (|l990l ) 
introduced a Monte Carlo version of the EM algorithm. Sampling from the truncated normal 
distributions involved is often based on Markov ch ain Monte Carlo (MCMC) methods and the 
Gibbs sampler in particular (see e.g. lGewekelll99lh . As a different option we employ a sequen- 
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tial Monte Carlo (SMC) sampler (IDel Moral et all 12006 ) instead. Though originally introduced 



in dy namical scenarios (|Gordon et all. 119931 : iKitagawaL 119961 : iLiu & Chen . 1998; Doucet et al. 
2001) as a more general alternative to the well know n Kalman filter (|Kalmani. il96 u). SMC al 



gorithms can also be used in static inference (see e.g. IChopinl. l2002r) where artificial dynamics 
are introduced. When the target is a truncated multivariate normal, as in our case, an obvious 
sequence of distributions is obtained by gradually shifting the truncation region to the desired 
position. Since normal distributions decay very quickly in the tails, we propose to use flatter Stu- 
dent t distributions to drive the SMC particles more efficiently towards the end region, and only 
then take the appropriate limit to recover the required truncated multivariate normal. 

The main difficulty in the M step rests with the computational complexity of standard nu- 
merical optimization over large parameter spaces, for which Meng & Rubin (1993) suggested 
a conditional maximization approach. A simple extension of their method allows us to define 
an iterative procedure to further maximize the likelihood at each M step. T hough the likelihood 
converges, there is no guarantee that the parameters converge to a point dWul. 119831) . Restric- 
tions to the parameter space have then been introduced to treat the ident ifiabil i ty issue where the 
data does not determine the parameters uniquely (IMcCuUoch & Rossil 1 1994 : iBock & Gibbonsl . 
19961) . raising the problem of constrained maximization, normally significantly more difficult 
than unconstrained. However, the constraints are necessarily artificial and we show that the two 
maximizations can be made identical, and how they ca n be easily computed. Finally we validate 
our methods by comparison with previous approaches dChib & Greenbergl . ll998l : ICraigll2008h . 



2. Multivariate probit model 
2-1. Notation 
Following the formulation in ^ Chib & GreenbergI (|1998l ). denote by y^ a binary vector cor- 
responding to the jth observation of a response variable Y^ with p components. Let xj be 
a size fc, column vector containing the covariates associated to its ith component and define 
X^ = diag((xj)^, . . . , (xp)^) as a p X k block diagonal matrix, with k = X^iLi ^i- ^ multi- 
variate probit model with parameters /3 G R'^ and T,, ap x p covariance matrix, can be specified 



pr{yJ■=yJ■|X^/3,S} 



where 



Ai 



A.p 



Mz';X^p,^)dz^, Ai 



(0,oo) ify^l, 
(-oo,0]ify^=0, 

■p is the density function of a multivariate normal random variable with mean /i = X^ /3 
and covariance matrix S. The vector of regression coefficient is /? = {pj , . . . , /3J)^, with each 
subvector /3j G M.'^' corresponding to the ith component of the response variable. Naturally the 
situation where the /3j are all identical is a special case. 

The probit model can also be understood in terms of a latent variable construction, where the 
observations are actually obtained from a sample of multivariate Gaussian vectors {z^, . . . , z^} 
from random variables Z ~ M{X^ /3, E) as yj = Iz<o{zj), with / the indicator function. 

The covariance matrix S is a crucial parameter for the multivariate probit model and indi- 
rectly accounts for any dependence among the components of the response variable. The identity 
matrix corresponds to the assumption of independence and the model reduces to a collection 
of one dimensional cases, for which /3 can be easily estimated and used as starting point for 
more elaborate inference str ategies. An alternative starting covariance matrix can be obtained 
(lEmrich & PiedmonteL Il99lb by pairwise approximations, which are lik ely howev er to lead to 
non positive definite matrices. 'Bending' techniques as suggested by Hayes & Hill (1981) are 
then necessary to ensure the positivity of the eigenvalues. 
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2-2. Monte Carlo EM 



An EM algorithm (IDempster et all 119771) allows us to build a sequence j^'"} of estimated 
parameters such that the likelihood is non decreasing. In terms of the complete {Y, Z) and condi- 
tional {Z I y, ^'") missing data distributions for a given estimate ip"^ at iteration m and observed 
data Y, the log-likelihood is 

/(V I Y) = log(pr{y I i,}) = Q{il^,i^n - ^(^, V-"), 

Q(^, V'") = Ez\Y,^^ [log(pr{y, Z I ^])\ , H{ij, V'") = i?z|y,^,™ [log(pr{Z | y, V}))] . 

Having the difference of two logs means that the argument of each is only defined up to the 
same multiplicative factor. Jensen's inequality implies that ff('(/',V'™) ^ H{ilj"^,xp"^), so that 
the likelihood is certainly increased at each step if Q(tl;"^~^^,Tp"^) > Q{ip"^,ip"^), leading to a 
generalized EM. Ideally we wish to set ^"^+1 to the value of tp which maximizes Q{ip, tp"^), as 
required by the actual EM. 

For the multivariate probit model, in terms of the latent variables Z^ ~ J\f{X^ P, S) and letting 
V' = (/?, S) be the parameter vector, the complete data log-likelihood function is 

N 
log(pr{y,Z I i;}) = J] log [lA,{z^)4>{z^-X^f],J:)]. 

Using the cyclicity of the trace and ig noring some normalizing constants, the corresponding 



Q{ip, V'™) function (IChib & Greenbergl . ll998) can be written as 



Qii^^i^n-- 2 



N^ . . iv 



log |S| + tr|s-il f]i?2,,y,,^„ {{Z^ - X^P)iZ^ - X-^/Jj^lj 



(2) 



The second term of ^ is analytically intractable since it involves expectations with respect 
to high dimen sional truncated multivariate Gaussian densities. In a Monte Carlo EM approach 



(|Wei & Tanner^ il990i ) the expectations can be approximated as 

M 



Ez^Y^,i,m {{Z^ - X^I3){Z^ - X^pf} ~ Y, W^^^\Z^'^^'^ - Xip){Zi^^^ - X^pf, (3) 



fc=i 



over a weighted sample {W^^^' , Z^^^'}^L^^, possibly approximated, from t:{z^ \ y^jip"^' 
TMN{A^ ,X^ /3, S), a multivariate normal distribution truncated to the domain A^ . 



3. SMC AND THE E STEP 
3- 1. Sequential Monte Carlo for truncated multivariate normals 



Sequential Monte Carlo samplers (|Del Moral et all l2006r) are a class of iterative algorithms 



to produce weighted sample approximations from a sequence {vr„} of distributions of interest 
where the normalizing constant C„ need not be known, 7r„ = 7n/C„. For a given probability 
distribution vr, one obtains a collection of weighted samples {W'^^\ Z^^)} such that E^^ {h{Z)) ~ 
Ylik=i W^^'h{Z^^>), where M is the number of particles and h a function of interest. In a static 
scenario the main purpose is to obtain such an approximation from the last element of the targeted 
sequence. 



In order to control for the degeneracy of the sample, resampling (see iDouc et all 120051 . for a 



review of resampling schemes) is typically performed when the effective sample size (ESS), as 
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YlkLiiwi'''^)'^, falls below a given threshold ESS* 



defined by |LiiL&Chen|(ll998b: ESS ^ 
rM (with < r < 1). The move from the target 7r„_i to the target 7r„ is achieved by means of a 

, and updating the normalized weights 



(k) / (k) 

transition kernel Kn, so that Z„ ~ i^„(Z^_]^, 






(fc). 






(fc) 



W 7(fc)> 






k = l,...,M. 



The quantit y -Ln_i in the expression for the incremental weights tt;„ is a backward kernel in- 
troduced bylDel Mo ral et al.. (,2006 ) to address computational issues. A typical choice for Kn is 
given by MCMC kernels with 7r„ as an invariant distribution and in particular we adopt a random 
walk Metropolis Hastings kernel. The samples at a given iteration n are obtained by moving each 



(fc) 



particle A: to a new location Zn = Y ~ J\f{Z\^\^ S„) with probability a = 1 A p and leav- 
ing it unchanged otherwise, with p^ = 7r„(y^)/7r„(Z^_]^). The covariance matrix S„ = kStt 

in the random walk proposal is a scaled version of an approximation Syr (typically obtained 
from the previously simulated sample) of the target covariance matrix. As extensively investi- 
gated in the MCMC literature (for example the original paper of lOilks et al.Lll998uHaario et all 
200 ll : lAtchade & Ro senthal 120051 . or the more recent review of lAndrieu & Thomsl. 120081) the 
scaling factor k can be adaptively tuned by monitoring the average empirical acceptance prob- 
ability a„ at iteration n. For the Metropolis Hastings transition kernel, this can be evaluated 

as an = Lfc=i 

within SMC has recently been considered by lJasra et al.l (|201ll ). 



ik') (k) (k) 

Wn (1 A TTn{Yn )/7r„ (Z^_ i )). Adapta t ion of the transition kernel specifically 



3-2. Multivariate normals via Students 
Since the probability of the random walk Metropolis to move towards the tails of a Gaussian 
distribution decreases exponentially, a SMC method involving normals may be highly inefficient 
in moving samples towards regions of low probability. To achieve higher rates of acceptance in 
the tails we suggest starting with a flatter distribution: the multivariate (of dimension p) Student t 
distribution T( if, u, S) with degree of fre edom u, mean vector fi and covariance matrix S, which 
can be defined (JNadarajah & KotzL 120051) as 



fiz) 



r(^) 



r(|)(7rzy)P/2|S|i/2 



i + -iz-i^f^-H 



m) 



2 



(4) 



Replacing the v in the denominator inside the square brackets by {u — 2), and correspondingly 
changing the normalization factor, would provide the Student distribution with a covariance of 
S. As it stands, the distribution in (HJl actually has a covariance of vTi/{v — 2) which further 
increases the acceptance in the tails. Once in the region of low probability we allow the degree 
of freedom to grow to infinity {v — )• oo) so the distribution approaches a p-variate Gaussian with 
the same mean and covariance matrix S. 

To sample in the region of interest A, we define a sequence of target distributions {vr^}^ 
such that the first target is an unconstrained multivariate Student and the last one is the same 
distribution truncated to A. Quite naturally the intermediate distributions are defined in terms 
of intermediate target domains {An}]), included in each other A^+i C Ak, with At = A and 
Aq = W\ The local target 7r„ at iteration n of the SMC algorithm is then 



t ^ ln{z) , . 

T^nKz) = -p^, 7n{z) 



l + -{z-fifT.-^{z-fi) 



2 



IaJz) 
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where Cn is a normalizing constant which can be estimated (IDel Moral et all 120061) from 



Cm — Cl 



n 



Cl 

Ci-^ 






M 

E 

fc=i 



W,(_nu;.(Za,zf) 



and Co follows from ([H). This ultimately allows us to obtain the probabilities of the regions in 
([Hi and hence the likelihood for the probit model. 

After reaching the required region, we define another sequence of target distributions starting 
from the truncated Student and increasing the degree of freedom z/ until it is large enough that 
we can replace the Student with the desired truncated multivariate normal. One could also vary 
both the truncation region and the degree of freedom concurrently in the sequence of target 
distributions, but since the main reason for introducing the flatter Student distribution is to aid 
moving to regions of low probability we chose this two-step approach. 



3-3. Adaptive approach to artificial dynamics 

Other than for tuning the transition kernel Kn, adaptive strategies can also be used to define 
the artificial dynamics leading to the distribution of interest ttt- We do not address the problem of 
finding the opti mal path linking an initia l measure ttq to the target ttt on the space of distributions 
in the sense of iGelman & MengI (| 1998b . who actually deal with this issue in relation to Monte 
Carlo sampling methods for the evaluation of ratios of normalizing constants. Here we assume 
instead that the functional form of the intermediate distribution is given and can be described in 
terms of a parameter 6. An adaptive strategy to move from ttq to -kt is one that does not require 
the sampling points {On} defining the intermediate targets {vr„} to be fixed a priori, but allows 
us to determine them dynamically on the basis of the local difficulty of the problem. 

Adaptation can be achieved by controlling some statistics related to the performance of the 
algorithm and evolving with the parameter 6, and the ESS introduced in subsection 13- II is an 
ideal quantity to monitor. Theoretically we wish to solve 



ESSm( 



ESS*, = 0, 



(5) 



where ESS^ is a value chosen to comprom ise between efficiency and accuracy. Inspired by the 
Robbins-Monro recursion (see for example Kushner & Yinll2003l. page 3) for stochastic approx- 
imation, and aiming at the dynamical design of a sequence which keeps the ESS on average 
close to the threshold ESS^, we define the updating scheme 



-1 + Cn 



M 



VA^ri 



A0T, 



(6) 



where ESSn is the value observed for ESS at iteration n and the division by the number of 
particles M is only introduced for scaling purposes. Taking the maximum between the correction 
term and ^6m\n ensures that the resulting sequence approaches the final target monotonically, 
while taking the minimum with 6t ensures that the sequence ends at the desired target vr^j,. 
Theoretically the ESS should ideally be equal to the total number of particles M of the SMC 
sampler, but to promote motion as a compromise between accuracy and efficiency, the threshold 
ESS^ can be fixed as a fraction r G (0, 1) of M, namely ESS^ = rM. The number of iterations 
needed to reach the target ttt is reduced for smaller r. Simi lar adaptive ideas have also recently 
been applied to inference for stochastic volatility models by lJasra et al.l (120111) . 
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Fig. 1. The number of steps required for the SMC algorithm to reach a region of 
probability r for dimensions 2 (diamonds), 4 (crosses), 8 (dots) and 16 (pluses). 



34. Scaling behaviour 

The advantage of the SMC method, over alternatives which may be more efficient in sampling 
from truncated multivariate normals in low dimensions, is the scaling behaviour with the dimen- 
sion p. Solving the adaptive equation ^ exactly means that we lose a fixed proportion of the 
probability mass at each iteration. The number of steps required to reach a target region of low 
probability r, then behaves like log(r), independently of p. This may not be true when using ^ 
as a numerical adaptive approximation to ([5]), especially as the number of steps for the adaption 
to settle grows linearly with p, so a weak dependence on the dimension could be expected. 

A simulation study with targets of dimensions 2" for n = 1, . . . , 4 was performed. To limit 
the sources of variability, only one covariance structure was considered for the unconstrained 
distribution, with unit diagonals and a single non-zero off-diagonal element of 0-9. The SMC 
algorithm was initialized so that after an initial move the Student t target would be truncated 
to a region containing one quarter of the probability mass of an independent Gaussian, and we 
denote by tq the actual estimated probability. The cutoff for the final target, the same in all 
directions, was drawn so as to ensure that the log probability of a multivariate standard normal 
would be uniform on a given interval. The number of steps needed to reach the target are plotted 
against log(ro/r) in Fig. [H for 400 runs of a SMC sampler with 4000 particles for the different 
dimensions. A behaviour close to linear can be observed, though the offset increases by a factor 
of about 1 -4 over the range of dimensions and the slope increases roughly linearly with p, which 
is likely due to any inexactness in the adaptation. The theoretical stab ility of these types of 
algorithms has recently been investigated in depth by Ifieskos et all (|201 ih . 



3-5. Sequential Monte Carlo EM 
After the initial sampling, which provides a particle approximation from the truncated target 
distribution corresponding to the initial parameter values, a sequential Monte Carlo approach 
can also be adopted to move between subsequent estimates ifj"^ = (/3™, S™) without the need to 
perform the complete truncation again. Multiple sub-steps might be needed to update ^™~^ to 
ij)"^, depending on how different the two corresponding targets are. For each observation j the 
local (to the EM iteration) initial and final distributions of the artificial sequence {7r„} are defined 
as TTo = TMN(^^X-'/3'^-\S™-i) and vrr = TMN(A-'',XJ73™, S™) respectively, while the 
parameter On defining the intermediate targets moves from ifj"^'^ to ijj"^, possibly in a single 
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step. To avoid the situation where we would effectively need to move to a bigger region, which 
w ould prevent us from us ing a simplified version of the backward kernel L„ (see section 3.3.2.3 
of Del Moral et al.Ll2006|) . we first rescale the previous sample to lie in the new truncation region, 



which can be done as long as the scaling factors are all positive, and then we apply the algorithm 
to update to the new covariance matrix. 



4. Cycling conditional maximizations 
4 • 1 . Two step maximization 
To overcome the difficulties associated with numerical maximization, Meng & RubinI (|l993h 



suggested replacing the maximization over the full parameter space by a multi-step conditional 
maximization over several subspaces in turn. They treat the example of multivariate normal re- 
gression with incomplete data, where the parameters ■0™ at step m can again be split into S™ 
and Z?™. This leads to a two-step conditional maximization which can be performed analytically. 
Keeping S fixed and maximizing equations (O and ([3]) over /3 we obtain 

, N \ -1 Af M 



■jr = l ' j = l k=l 



SO that by setting S = S"^ we can update the mean vector parameters for the next step as /SJ""*"^ 
/3. Fixing /3, the S which maximizes equation ([2]) is instead 

N M 

± = —Y,Yl W^'^^\Z^^-^^ - X^^){Z^^^'> - X^(3f, (8) 

j=i k=i 

so that by setting /3 = /3J^p^ we can the update the covariance matrix to S^j^^ = S to give 
the new parameters V'mr • Though this two-step approach does not maximize ijj at each step, 
it removes the need for computationally intensive maximization and increases the likelihood at 
each step to ensure convergence of the (generalized) EM. 

4-2. Further maximization 
Since equation ^ maximizes Q{'iIj, V'"*) over S for any value of /?, we can substitute S into 
Q{ip, V'") iri ® and obtain a function which only depends on /3 

N - Nv 
Q(/?,V™) = -ylog|5]|-^, (9) 

Finding the value /5 which maximizes Q over /3 and setting S = S(/3) in ([H) provides the new 
parameter ifj which maximizes the likelihood. Performing the differential of ^ leads to the 
condition tr{S^^di;} = 0. Though dS is linear in the components of /3, the inverse matrix S~^ 
leads to a system of coupled higher order polynomial equations. Solving these is impracticable, 
but one can proceed iteratively. As a starting point we can choose /3™+^ from the conditional 
maximization of ^ so that Q has the value found from the two-step conditional maximization 
above (i.e. we set ^™+i'" = /3J^p^^ for n = 0). One option would be to perform Newton-Raphson 
iterations, but if the starting point is not too far from the maximum we can employ a simpler 
approximate maximization. Setting 5]»"+i'" = S(/?"^+i'"), we separate S = 5]"^+^'" + AS and 
make the approximation log(l + x) « x to rewrite 

Q(/3,V'") « -— tr|logS'^+i'"| - — tr|(S"^+i'")-iAs| . 
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Maximizing this is solving 



j=i k=i 

where we used the cyclicity of the trace to simpUfy. These are now Unear equations in the com- 
ponents of /3, which can easily be solved to find ^»"+i."+i jn fact the solutions are given 
precisely by © but now evaluated at the point S'^+i'", so that /3™+i'"+i = ^^(S^+i'") and 

j^m+l,n+l __ ^|'om+l,n+l\ j-^ gjyg Jjm+l,n+l 

4-3. From generalized EM to EM 
Neatly, the l ogarit hmic approximation and the two step conditional maximization of 



Meng & RubinI (| 19931 ) are equivalent when started at the same point, (V'™ or V'mr ^^^ exam- 
ple). Because of the approximation, the values of j3 found in this way do not maximize Q but 
can be used as starting points for the next iteration to get closer to the maximum. In general with 
approximations the surety of convergence or even of not decreasing Q is lost, but, due to the 
equivale nce above, eac h iteration does not decrease the likelihood and convergence follows from 
JMeng & RubinI (| 19931 ). To complete the EM algorithm one can set V'™^^ = lim„^oo -0"^+^'", 
and numerically stop the iterations when the Euclidean norm | |^™+i."+i _ ^m+i,n| | j^ small. 
Thou gh we have focu s ed on multivariate normals, cycling through the conditional maximiza- 



tions of IMeng & Rubin (| 19931 ) until convergence can be applied more generally, turning the 



generalized EM of their single round procedure into an EM again. However, as they mention, it 
may be computationally advantageous to perform an E step between conditional maximizations 
when these are more demanding, and then the algorithm remains a generalized one. 



5. IDENTIFIABILITY ISSUE 
5-1. Identifiability 
When the data is 'incomplete' maximization of the likelihood will not lead to uniquely iden- 
tified parameters. Imposing constraints is a standar d measure to ensure identifiability, but often 
with the effect of rn aking the M step more involved (|Bock & GibbonsLll996l : IChan & KukLll997l : 



Kuk & Chanl.l200lb . The issue is directly linked to symmetries of the likelihood, where it is in- 



variant under some change of coordinates of the parameters. Focusing on global symmetries 
where the invariance of the likelihood C{il)) does not depend on the particular value of -0 € ^ 
we can decompose ^ = A x H into an invariant space A and a reduced parameter space H so 
that if) = ((5, ^) with 5 G A and ^ G H. Due to the invariance of the likelihood over A 

C{i)) = C{5,i) = t{i) => max£(?/') = max£(e), 

unconstrained maximization over the whole space ^ is identical to performing it 'constrained' 
over the reduced space H, with the difference that the parameters maximizing the likelihood in 
the larger space are -0* = A x ^*. Conversely, if the likelihood depended on some subspace of 
A then it would be identified during the maximization process. Therefore the dimension of A is 
the number of constraints needed to ensure identifiability. 

In addition to any global symmetries, the likelihood function could also show a local symme- 
try so that £(£) is maximized by a higher dimensional manifold rather than a single point (as 
discussed in Wu, 1983). In principle a local change of variables is possible (for example mak- 
ing the non-zero eigenvalues of the Hessian equal to — 1 around the maximum) to decompose 
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the space further, but in practice this presumes knowledge of the Ukelihood function. As above 
though, maximization over the subspace or the whole space are exactly equivalent because we 
still have (local) dimensions which do not affect the value of the likelihood. 

Within the EM algorithm the identifi ability issue becomes more subtle since the likelihood is 
not maximized directly, but by proxy through the function Q(V', ip"^). If this were to share the 
symmetries of the likelihood, then the simpler unconstrained maximization would be equivalent 
to the constrained version, as for the likelihood. If this is not the case, for example due to con- 
ditioning on the previous pai^ameter value ^'", then any changes in Q arising from shifting ip in 
the invariant space A of the likelihood must be exactly mimicked by changes in H. This spu- 
rious dependence can create differences between constrained and unconstrained maximization. 
The non decreasing behaviour of the likelihood remains preserved, since neither maximization 
decreases Q nor, because of Jensens 's inequality, increases H. Hence either choice leads to the 
EM algorithm finding a maximum of the likelihoo d (though not necess arily the same one) and 
explains the conjecture of jBock & GibbonsI (119961) : IChan & K ulj ((1997) and the agreement be- 



tween constrained and unconstrained maximization found in lKuk & Chan. (2001.) . 



5 • 2. Identifiability for the multivariate probit model 
In the multivariate probit model, the symmetries are related to the invariance of the likeli- 
hood under a rescaling of the coordinates of the normal variables. The full parameter space 
^ comprises p{p + l)/2 entries from the covariance matrix S and k regression coefficients 
from p. Scaling the coordinates Z^ = DU^ by means of a diagonal matrix D with positive 
entries {d\, . . . ,dp), transforms the covariance matrix to i7 = D^^Y^D^^ and the vector /3 to 
A = {d^^fij, . . . ^dp^Pp)"^ but can easily be checked to leave the likelihood unchanged. Choos- 
ing the entries of D to be the square root of the diagonal elements of S reduces $7 to corre- 
lation form. The invariant space A can then be spanned by the p diagonal elements of S (i.e. 
(5i = l/-v/(Tii etc.) while the reduced space H includes the p{p — l)/2 rescaled upper triangular 
elements of Q. (i.e. ujij = diSjaij) and the k elements of A = {SiPj, . . . , SpPp)"^ . 
The likelihood is not maximized directly, but through the function 

^ r r 1 /I 

Q(V', V"^) = ^ j^^ log ^^^ exp (^- 2 (^^^'^ - Xip?^-\zi - X^P) 

xTMN(^^X^73™,S")dz^ (10) 

which is only invariant under a change of integration variables Z^ = DU^ , for a diagonal matrix 
D, if we include a factor \D\ inside the log. Moreover, both tp and tp"^ need to be scaled by same 
matrix so that essentially 6 = 6"^. Although both tp and tp"^ have independent invariant spaces for 
the lik elihood, the Q function ties them together in this apparent constraint. IChib & GreenbergI 
(119981) therefore maximized inside the constrained space H, while keeping (5j = 1. Denote by ip^ 



the parameter value found under such constraints, and by ipu the one obtained through uncon- 
strained maximization of Q. Clearly Q{ipu,ip'^) > Q{ipc, '>P"^), but if we project V'u to a point 
ipp in the constrained space E so that 6i = 1 then Q{ipp, ip"^) < Qiipc, tp™')- Since the likelihood 
is invariant under this projection 

Q{^P^,^Pn-Q(i^p,i^n = H{,p^,i^^)-H{,Pp,^P^), 

and without any information on H{Tpu,ip"^) — H{ipc,'ip^) it is impossible to say which maxi- 
mization increases the likelihood most and is to be preferred in that respect. 
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5-3. Reintroducing invariance 
To remove the above ambiguity, Q can be redefined to respect the invariance of the likelihood, 
for example by replacing (S, /?) in (ITOl ) by their projection (il, A). Such a replacement effectively 
enforces invariance of the resulting function Q with respect to a rescaling of (5],/3), making 
constrained and unconstrained maximization identical. However, this is no longer true if we 
perform a (cyclical) two-step conditional maximization. With the replacement Q becomes 

(11) 





Q(V^,V'") = -y 


i°gpi+tr{^s-iz?5} 


s~ 


N M 


(k) _jj-ixii3)(z^ik) _D-^ 



X^pf, (12) 

with D a diagonal matrix whose elements are the square roots of the diagonal elements of S (so 
that O = D^^HD^^). Though Q may appear to be limited to the constrained space, it depends 
on the full parameter space when one of S or /3 are given. Assume that for given ip"^ and /?'""'" ^ = 
yn+i ^g ^jgj^ jQ ^jjj S"'"'"^. Constrained maximization enforces 6i = 1 to find Q^'^^ and hence 
i/j^~^^. An unconstrained maximization allows 6i to vary, leading to S™+^ and correspondingly 
to V'jr^^. such that (3(^™'''^, "0"^) ^ Qi'PT'^^^ ip"^). Because of the invariance, the projection of 
tp^~^^ does not now change Q resulting in a point in the constrained space with a higher value. 
In fact P is only defined up to a scale, which need not be preserved during each conditional 
maximization, nor given the stochastic nature of the estimation step. 

Fixing S, the value of /? maximizing equations ([TT]) and (IT2l) is as in ([7]l, but with an extra 
factor D before the sum over k. Maximization with fixed /3 over S can in turn be done in two 
steps. The differential dS is split into a diagonal and an off-diagonal part. The condition for the 
latter to vanish is that (il^^ — 17^^50^^) be itself a diagonal matrix. As long as the diagonal 
elements of S are not too far from 1, a solution can be found by a simple iterative approach 
starting from an arbitrary Qq and then solving for the diagonal matrix A the linear equations 

Qk+i = S + QkAQk, (13) 

so that ^k+i is in correlation form. For fixed D the steps above allow us to perform constrained 
maximization for both (fTTI) and ^. If D can vary, for the diagonal elements of dS to vanish 

N M 

A-I + n-^—^^W^^''\Z^ - D~^X^P){Z^f, (14) 

j=l k=l 

must have zero along the diagonal; a linear equation in the inverse elements of D. The solution 
depends on Q, which in turn depends (through S) on D so to perform the unconstrained maxi- 
mization of (fTTI) over S for a given /3 we would need to cycle through solving (fT4l) and (fT3] ). As 
such, the difference between constrained and unconstrained maximization is made transparent. 

5-4. Model constraints 

In practice some constraints might already be imposed at the modelling stage. Typical for 
the multivariate probit model is to require that all the regression vectors are identical: /3j = /3i, 
replacing X^ fi by Xlf3i, with Xl a matrix whose ith row is {xj)'^. The conditional maximization 
steps in Section |4] then allow one to maximize over the constrained space of (S, /3i). 

However, the invariance of the likelihood needs to be reconsidered in light of the new con- 
straints, which are broken when scaling the coordinate directions, and hence the /3j, by different 
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positive factors. The likelihood is now left unchanged, independently of X\ only when rescal- 
ing all the directions by the same amount, corresponding to a one dimensional invariant space. 
A reduced space can be defined by fixing the first diagonal element of the covariance matrix to 
1, call (0', A'^) the corresponding parameters. An invariant Q is obtained by replacing X-',!! and 
fi in (fTOb by Xl, Q' and X'l respectively and by setting all the elements of D in (fTTT) to be the 
square root of the first element of S. Constrained and unconstrained maximization follow from 
subsection 15 -3 1 but with the slight changes that only the first element of the matrix A in (IT3] ) is 
non-zero and just the trace of (fT4l) needs to be 0. 

The effect on the invariance o f assuming equal regressi on coefficients across components 



seems to have been overlooked by IChib & GreenbergI (|l998h as they required Vl' to be in corre 



lation form. Maximizing over an overly constrained space leads in general to a lower likelihood 
than when only imposing the conditions needed to ensure identifiability. Nevertheless, were the 
correlation form desired for modelling reasons, one can perform the maximization by setting D 
to be the identity matrix and using Xi in the formulae in subsection [ 



6. Comparison to existing approaches 

6-1. The data and model used 

To assess the performance of our method, we treat the widely analysed data set from the Six 

Cities longitudinal study on the health effects of air pollution, for which a multivariate probit 

model was considered by Chib & Greenberd (|l998h . who conducted both Bayesian and non- 



I I T 



Bayesian analysis. Later ' Song & L ee (2005) proposed a confirmatory factor analysis for the 



same model. More recently iCraigl (2008) used the example as a test case for his new method 
of evaluating multivariate orthant probabilities. 

The study was meant to model a probabilistic relation over time between the wheezing status 
of children, the smoking habit of their mother during the first year of observation and their age. 
In particular the subset of data considered for analysis refers to the observation of 537 children 
from Stueberville, Ohio. The wheezing condition yl of each child j at age i G {7, 8, 9, 10} and 
the smoking habit h^ of their mother are recorded as binary variables, with value 1 indicating 
the condition (wheezing/smoking) present. Three covariates are assumed for each component i, 
namely the age x^^ = i — 9 of child j centred at 9, the smoking habit x^2 = ^^ ^'I'i ^^i interaction 
term x^g = {i — 9)h^ between the two. A probit model can then be constructed 

pr{yi = 1} = pr{z{ > 0) = $(/3o + /3i • x^^ + /Sa • x{^ + f^s ■ x{^), 

where zj is the i-th component of a multivariate random variable Z^ ~ J\f{Xij3, S) and $ is the 
cumulative distribution function of a standard normal random variable. 

6-2. Testing our algorithm 

To fit the model, a SMC sampler was implemented with the number of particles increasing 

from a starting value of 100 by 100 at each iteration up to 40, followed by 10 further steps of 

variance reduction (described below) with 4000 pa rticles. Results for the con strai ned maximiza - 

tion are presented in Table [U along with those of IChib & GreenbergI (1 1998b and ICraigl (|2008|). 



Good agreement both for the estimates and the standard errors can be observed. Also given are 
average values of the corresponding log-likelihoods, easily obtained as a by-product of the SMC 
samplers, together with the standard deviation estimates over 40 runs. No real diff erences can b e 



seen, with likelihoods comparable to, but slightly below, the estimate of -794-74 in lCraigl (120081) . 
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Table 1 . Maximum likelihood estimates for the six cities dataset as obtained by 
using the constrained SMC algorithm with variance reduction and for a single 
run where the sa mples are recycled. Included for comparison are the results of 
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Chib&Greenbergfl998) 



Po 


-1118 


(65) 


Pi 


-79 


(33) 


P2 


152 


(102) 


Pz 


39 


(52) 


0"i2 


584 


(68) 


(713 


521 


(76) 


(714 


586 


(95) 


(723 


688 


(51) 


(724 


562 


(77) 


(734 


631 


(77) 


m) 


-795-26 


(0-75) 



Craig ( 


2008) 


variance 
-1123 


reduction 
(62) 


recycled 
-1124 


samples 


-1122 


(62) 


(62) 


-78 


(31) 


-78 


(31) 


-79 


(32) 


159 


(101) 


159 


(101) 


159 


(101) 


37 


(51) 


37 


(51) 


37 


(51) 


585 


(66) 


582 


(67) 


582 


(66) 


524 


(72) 


522 


(72) 


523 


(71) 


579 


(74) 


575 


(75) 


572 


(74) 


687 


(56) 


684 


(57) 


683 


(56) 


559 


(74) 


557 


(75) 


554 


(74) 


631 


(67) 


629 


(68) 


629 


(68) 


795-21 


(0-97) 


-795-22 


(0-82) 


-795-30 


(0-91) 



The value in brackets next to each estimate is the estimated standard error. The values of the param- 
eters (and their errors) have all been multiplied by 1000. 

Table 2. Example maximum likelihood estimates for the six cities dataset obtained using the 
unconstrained SMC algorithm for non-invariant Q, invariant Q and by fixing an = 1 





Po 


Pi 


P2 


Pz 


(712 


(713 


(714 


^22 


(723 


(724 


(733 


(734 


(744 


iW 


Q 


-1176 


84 


159 


41 


647 


592 


572 


1208 


855 


619 


1255 


715 


1001 


-793-37 


Q 


-1235 


-113 


168 


47 


664 


622 


612 


1275 


921 


683 


1383 


802 


1146 


-793-15 


fixed (711 


-1241 


-116 


169 


48 


666 


626 


615 


1279 


927 


686 


1395 


809 


1158 


-793-07 



The standard deviations of the log-likelihood estimates are 0-90, 0-75 and 0-70 respectively. The values of the param- 
eters have all been multiplied by 1000. 



Results from recycling the samples in a SMC EM algorithm as in subsection I3-5I with 4000 
particles and 40 iterations are in the last column of Table [T] Since oscillations before the variance 
reduction step were around 0-001 between interations (with 4000 particles), parameter estimates 
when recycling the sample are essentially equivalent, at a much reduced computational cost. 

An additional 20 iterations with 4000 particles are included before the variance reduction step 
for the unconstrained maximization, since it may take longer for the EM algorithm to explore 
a larger space. A fairly robust point is found with the non-invariant Q, while the invariant Q 
seems to lead to a flatter likelihood neighbourhood, with the solution appearing more sensitive to 
the number of particles during earlier iterations or on imposing the constraint of fixing an to 1. 
Results are given in Table |2l and again can be quite closely reproduced by recycling the samples 
in a sequential manner between parameter updates. Unfortunately, noise in the estimation of the 
observed information matrix overly influenced its numerical inversion, so that robust standard 
errors could not be obtained. 



6-3. Variance reduction 
To reduce the variance associated with the stochastic nature of the Monte Carlo E step, the 
parameter can be updated according to a stochastic approximation type rule 



^m ^ ^n^-l ^ ^^(^m _ ^n^-l) ^ (^ _ <^^)^--l ^ ^^^. 
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where V'm is the actual estimate obtained from the M-step and C,m G (0, 1) a stepsize with the 
purpose of gradually shifting the relative importance from the innovation {ip"^ — ip"^"^) to the 
value of the parameter V'm-i learned through the previous iterations. The scheme is like taking 
a weighted average of the previous estimates, so we have referred to it as a 'variation reduction' 
step. This way the monotonicity property of the EM algorithm is not guarateed, but as long 
as the parameter remain within a neighbourhood of the maximum likelihood where it can be 
approximated quadratically, monotonicity follows so that in many practical cases this matter 
may not cause any issues. 
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